Goto

Collaborating Authors

 augmentation operation


dc49dfebb0b00fd44aeff5c60cc1f825-Supplemental.pdf

Neural Information Processing Systems

AAblationStudy1 In this ablation study, we further investigate the power of the policies searched by our approach2 andtheclosely related method AutoAug [1]. Thenwegradually removethemostimportant operations fromthe5 searched policy one by one and investigate the change of the Top-1 test error rates, as reported in6 Tab.1. Figure 1: We investigate the key hyper-parameterNlate by visualizing the difference it brings to thesearchdynamics. We investigate different numbers of epochs in the late training stage(Nlate). By adjusting Nlate20 we can still maintain the reliability ofpolicyevaluation toalarge extent.



Unsupervised Data Augmentation for Consistency Training

Neural Information Processing Systems

Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise.


Supervised Graph Contrastive Learning for Gene Regulatory Networks

Oshima, Sho, Okamoto, Yuji, Tosaki, Taisei, Kojima, Ryosuke, Okuno, Yasushi

arXiv.org Artificial Intelligence

Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and to be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided but a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments and using the latter as explicit supervisory signals. To assess effectiveness, we train GRN representations with SupGCL and evaluate their performance on downstream tasks. The evaluation includes both node-level tasks, such as gene function classification, and graph-level tasks on patient-specific GRNs, such as patient survival hazard prediction. Across 13 tasks built from GRN datasets derived from patients with three cancer types, SupGCL consistently outperforms state-of-the-art baselines. Graph representation learning has recently attracted attention in various fields to learn a meaningful latent space to represent the connectivity and attributes in given graphs (Ju et al., 2024). The application of graph representation learning to Gene Regulatory Networks (GRNs), which contain information about intracellular functions and processes, is particularly important in the fields of biology and drug discovery. It is expected to contribute to the identification of therapeutic targets and the elucidation of disease mechanisms. Representation learning for GRNs has been applied to tasks such as transcription factor inference (Y u et al., 2025) and predicting drug responses in cancer cell lines (Liu et al., 2022). Advances in gene expression measurement and analysis technologies have enabled the construction of patient-specific GRNs, highlighting gene regulation patterns that differ from the population as a whole (Nakazawa et al., 2021). Hereafter, this paper will refer to such individualized networks simply as GRNs.





Augmentation-Based Deep Learning for Identification of Circulating Tumor Cells

Russo, Martina, Bertolini, Giulia, Cappelletti, Vera, De Marco, Cinzia, Di Cosimo, Serena, Paiè, Petra, Brancati, Nadia

arXiv.org Artificial Intelligence

Circulating tumor cells (CTCs) are crucial biomarkers in liquid biopsy, offering a noninvasive tool for cancer patient management. However, their identification remains particularly challenging due to their limited number and heterogeneity. Labeling samples for contrast limits the generalization of fluorescence-based methods across different hospital datasets. Analyzing single-cell images enables detailed assessment of cell morphology, subcellular structures, and phenotypic variations, often hidden in clustered images. Developing a method based on bright-field single-cell analysis could overcome these limitations. CTCs can be isolated using an unbiased workflow combining Parsortix technology, which selects cells based on size and deformability, with DEPArray technology, enabling precise visualization and selection of single cells. Traditionally, DEPArray-acquired digital images are manually analyzed, making the process time-consuming and prone to variability. In this study, we present a Deep Learning-based classification pipeline designed to distinguish CTCs from leukocytes in blood samples, aimed to enhance diagnostic accuracy and optimize clinical workflows. Our approach employs images from the bright-field channel acquired through DEPArray technology leveraging a ResNet-based CNN. To improve model generalization, we applied three types of data augmentation techniques and incorporated fluorescence (DAPI) channel images into the training phase, allowing the network to learn additional CTC-specific features. Notably, only bright-field images have been used for testing, ensuring the model's ability to identify CTCs without relying on fluorescence markers. The proposed model achieved an F1-score of 0.798, demonstrating its capability to distinguish CTCs from leukocytes. These findings highlight the potential of DL in refining CTC analysis and advancing liquid biopsy applications.


UniCL: A Universal Contrastive Learning Framework for Large Time Series Models

Li, Jiawei, Peng, Jingshu, Li, Haoyang, Chen, Lei

arXiv.org Artificial Intelligence

Time-series analysis plays a pivotal role across a range of critical applications, from finance to healthcare, which involves various tasks, such as forecasting and classification. To handle the inherent complexities of time-series data, such as high dimensionality and noise, traditional supervised learning methods first annotate extensive labels for time-series data in each task, which is very costly and impractical in real-world applications. In contrast, pre-trained foundation models offer a promising alternative by leveraging unlabeled data to capture general time series patterns, which can then be fine-tuned for specific tasks. However, existing approaches to pre-training such models typically suffer from high-bias and low-generality issues due to the use of predefined and rigid augmentation operations and domain-specific data training. To overcome these limitations, this paper introduces UniCL, a universal and scalable contrastive learning framework designed for pretraining time-series foundation models across cross-domain datasets. Specifically, we propose a unified and trainable time-series augmentation operation to generate pattern-preserved, diverse, and low-bias time-series data by leveraging spectral information. Besides, we introduce a scalable augmentation algorithm capable of handling datasets with varying lengths, facilitating cross-domain pretraining. Extensive experiments on two benchmark datasets across eleven domains validate the effectiveness of UniCL, demonstrating its high generalization on time-series analysis across various fields.


Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methods

Mumuni, Alhassan, Mumuni, Fuseini

arXiv.org Artificial Intelligence

Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. Automated data augmentation methods aim to automate the process. State-of-the-art approaches typically rely on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. We present extensive discussion of techniques for realizing each of the major subtasks of the data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.